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Possible Directions for Improving 
Dependency Versioning in R 



by jeroen Ooms 

Abstract One of the most powerful features of 
R is its infrastructure for contributed code. The 
built-in package manager and complementary 
repositories provide a great system for develop- 
ment and exchange of code, and have played an 
important role in the growth of the platform to- 
wards the de-facto standard in statistical com- 
puting that it is today. However, the number of 
packages on CRAN and other repositories has 
increased beyond what might have been fore- 
seen, and is revealing some limitations of the 
current design. One such problem is the general 
lack of dependency versioning in the infrastruc- 
ture. This paper explores this problem in greater 
detail, and suggests approaches taken by other 
open source communities that might work for R 
as well. Three use cases are defined that exem- 
plify the issue, and illustrate how improving this 
aspect of package management could increase 
reliability while supporting further growth of 
the R community. 

Package management in R 

One of the most powerful features of R is its infras- 
tructure for contributed code (Fox, 2009). The base R 
software suite that is released several times per year 
ships with the base and recommended packages 
and provides a solid foundation for statistical com- 
puting. However, most R users will quickly resort 
to the package manager and install packages con- 
tributed by other users. By default, these packages 
are installed from the "Comprehensive R Archive 
Network" (CRAN), featuring over 4300 contributed 
packages as of 2013. In addition, other reposi- 
tories like BioConductor (Gentleman et al., 2004) 
and Github (Dabbish et al., 2012) are hosting a re- 
spectable number of packages as well. 

The R Core team has done a tremendous job in 
coordinating the development of the base software 
along with providing, supporting, and maintaining 
an infrastructure for contributed code. The system 
for sharing and installing contributed packages is 
easily taken for granted, but could in fact not sur- 
vive without the commitment and daily efforts from 
the repository maintainers. The process from sub- 
mission to publication of a package involves sev- 
eral manual steps that are needed to ensure that all 
published packages meet standards and work as ex- 
pected, on a variety of platforms, architectures and 
R versions. In spite of rapid growth and limited re- 



sources, CRAN has managed to maintain high stan- 
dards on the quality of packages. Before continu- 
ing, we want to express appreciation for the count- 
less hours invested by volunteers in organizing this 
unique market place for statistical software. They 
have facilitated development, innovation and collab- 
oration in our field, and united the community in cre- 
ating software that is both of the highest quality and 
publicly available. We want to emphasize that sug- 
gestions made in this paper are in no way intended 
as criticism on the status quo. If anything, we hope 
that our ideas help address some challenges to sup- 
port further growth without having to compromise 
on the open and dynamic nature of the infrastruc- 
ture. 

The dependency network 

Most R packages depend on one or more other pack- 
ages, resulting in a complex network of recursive de- 
pendencies. Each package includes a 'DESCRIPTION' 
file which allows for declaration of several types of 
dependencies, including Depends, Imports, Suggests 
and Enhances. Based on the type of dependency rela- 
tionship, other packages are automatically installed, 
loaded and /or attached with the requested pack- 
age. Package management is also related to the issue 
of namespacing, because different packages can use 
identical names for objects. The 'NAMESPACE' file 
allows the developer to explicitly define objects to 
be exported or imported from other packages. This 
prevents the need to attach all dependencies and 
lookup variables at runtime, and thereby decreases 
chances of masking and naming-conflicts. Unfortu- 
nately, many packages are not taking advantage of 
this feature, and thereby force R to attach all depen- 
dencies, unnecessarily filling the search path of a ses- 
sion with packages that the user hasn't asked for. 
However, this is not the primary focus of this paper. 

Package versioning 

Even though CRAN consistently archives older ver- 
sions of every package when updates are published, 
the R software itself takes limited advantage of this 
archive. R identifies packages by name only, both 
when installing and loading a package. For exam- 
ple, the install .packages function downloads and 
installs the current version of a CRAN package into 
a single global library. The library contains only 
one version of each package. If a previous ver- 
sion of the package is already installed on the sys- 
tem, it is overwritten without warning. Similarly, 
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the library function will load earliest found pack- 
age with a matching name. And when a package is 
loaded that requires a dependency, the most recent 
version of the dependency is again loaded from the 
global library. This default behavior is quite hard to 
avoid. One can try to manually create and maintain 
separate libraries for different tasks and projects, but 
there are two fundamental limitations of R that make 
this impractical. First, the default behavior of R dis- 
courages authors to explicitly require a specific ver- 
sion of a dependency, because in the situation of a 
single global library, there is only one version of each 
package available. Second, R cannot have multiple 
versions of the same package loaded simultaneously. 
This is perhaps the most fundamental problem be- 
cause it is nearly impossible to work around. 

The fact that the current implementation of R 
identifies packages by their name only, implicitly 
makes the assumption that different versions of a 
package are interchangeable. This basic assumption 
has far-reaching implications and consequences on 
the distributed development process and reliability 
of the software as a whole. In the context of the in- 
creasingly large pool of inter-dependent packages, 
violations of this assumption are becoming increas- 
ingly apparent and problematic. In this paper we ex- 
plore this problem is greater detail, and try to make a 
case for moving away from this assumption, towards 
systematic versioning of dependency relationships. 

The word dependency in this context does not ex- 
clusively refer to formally defined relations between 
R packages. Our interpretation is a bit more gen- 
eral in the sense that any R script, Sweave document, 
or third party application depends on R and certain 
packages that are needed to make it function. The 
paper is largely motivated by personal experiences, 
as we have come to believe that limitations of the 
current dependency system is underlying multiple 
problems that R users and developers are experienc- 
ing. Properly addressing this concern could resolve 
several lingering issues at once, and make R a more 
reliable and widely applicable analytical engine. 

Use cases 

A dependency defines a relationship wherein a cer- 
tain piece of software requires some other software 
to run or compile. However, software constantly 
evolves, and in the open source world this hap- 
pens largely unmanaged. Consequently, any soft- 
ware library might actually be something different 
today than it was yesterday. Hence, solely defining 
the dependency relationship in terms of the name 
of the software is often insufficient. We need to 
be more specific, and declare explicitly which ver- 
sion^), branch(es) or release(s) of the other software 
package will make our program work. This is what 
we will refer to as depencency versioning. 



This problem is not at all unique to R; in fact a 
large share of this paper consist of taking a closer 
look at how other open source communities are man- 
aging this process, and if some of their solutions 
could apply to R as well. But first we will elabo- 
rate a bit further on how this problem exactly ap- 
pears in the context of R. This section describes three 
use cases that reveal some limitations of the current 
system. These use cases delineate the problem and 
lead towards suggestions for improvements in sub- 
sequent sections. 

(1) Archive / repository maintenance 

A medium to large sized repository with thousands 
of packages has a complicated network of dependen- 
cies between packages. CRAN is designed to con- 
sider the very latest version of every package as the 
only current version. This design relies on the (im- 
plicit) assumption that, at any given time, the latest 
versions of all packages are compatible. Therefore, 
R's built-in package manager can simply download 
and install the current versions of all dependencies 
along with the requested package, which seems con- 
venient. However, to developers this means that ev- 
ery package update needs to maintain full backward 
compatibility with all previous versions. No version 
can introduce any breaking changes, because other 
packages in the repository might be relying on things 
in a certain way. Functions or objects may never be 
removed or modified; names, arguments, behavior, 
etc, must remain the same. As the dependency net- 
work gets larger and more complex, this policy be- 
comes increasingly vulnerable. It puts a heavy bur- 
den on contributing developers, especially the pop- 
ular ones, and results in increasingly large packages 
that are never allowed to deprecate or clean up old 
code and functionality. 

In practice, the assumption is easily violated. Ev- 
ery time a package update is pushed to CRAN, there 
is a real chance of some reverse dependencies failing 
due to a breaking change. In the case of the most 
popular packages, the probability of this happening 
is often closer to 1 than to 0, regardless of the author. 
Uwe Ligges has stated in his keynote presentation at 
useR that CRAN automatically detects some of these 
problems by rebuilding every package up in the de- 
pendency tree. However, only a small fraction of po- 
tential problems reveal themselves during the build 
of a package, and when found, there is no clear solu- 
tion. One recent example was the forced roll-back 
of the ggplot2 (Wickham, 2009) update to version 
0.9.0, because the introduced changes caused several 
other packages to break. The maintainer of the gg- 
plot2 package has since been required to announce 
upcoming updates to maintainers of packages that 
depend on ggplot2, and provide a release candidate 
to test compatibility. The maintainers of the depen- 
dent packages are then required to synchronize their 
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releases if any problems arise. However, such man- 
ual solutions are far from flawless and put even more 
work on the shoulders of contributing developers. It 
is doubtful that all package authors on CRAN have 
time and resources to engage in an extensive dia- 
logue with other maintainers for each update of a 
package. We feel strongly that a more systematic 
solution is needed to guarantee that software pub- 
lished on CRAN keep working over time. 

With the repository reaching a certain size, and 
some packages having hundreds of reverse depen- 
dencies, we have little choice but to acknowledge 
the fact that every package has only been developed 
for, and tested with certain versions of its dependen- 
cies. A policy of assuming that any current or fu- 
ture version of a dependency will suffice is danger- 
ous and sets the wrong incentives for package au- 
thors. It discourages change, refactoring or cleanup, 
and results in packages accumulating an increasingly 
heavy body of legacy code. And as the repository 
grows, it is inevitable that packages will neverthe- 
less eventually break as part of the process. What 
is needed is a redesign that supports the continuous 
decentralized change of software and helps facilitate 
a more reliable development process. This is not im- 
possible: there are numerous open source commu- 
nities managing repositories with more complex de- 
pendency structures than CRAN. Although specifics 
vary, they form interesting role models to our com- 
munity. As we will see later on, a properly archived 
repository can actually become a great asset rather 
than a liability to the developer. 

(2) Reproducibility 

Reproducible research is an important topic in sci- 
ence and statistics. The CRAN Task View: Repro- 
ducible Research states that: 

The goal of reproducible research is to tie spe- 
cific instructions to data analysis and exper- 
imental data so that scholarship can be recre- 
ated, better understood and verified. 

In R, reproducible research is largely facilitated us- 
ing literate programming techniques implemented 
in packages like Sweave that mix (weave) R code 
with I^T£X-markup to create a "reproducible docu- 
ment" (Leisch, 2002). However, those ever faced 
with the task of actually reproducing such a docu- 
ment might have experienced that the Sweave file 
does not always compile out of the box. Especially if 
it was written several years ago and loads some con- 
tributed packages, chances are that essential things 
have changed in the software since the document 
was created. When we find ourselves in such a situ- 
ation, recovering the packages needed to reproduce 
the document might turn out to be non-trivial. 

An example: suppose we would like to reproduce 
a Sweave document which was created with R 2.13 



and loads the caret package. If no further instruc- 
tions are provided, this means that any of the ap- 
proximately 25 releases of caret in the life cycle of R 
2.13 (April 2011 to February 2012) could have been 
used, making reproducibility unlikely. Sometimes 
authors add comments in the code where the pack- 
age is loaded, stating that e.g. caret 4.78 was used. 
However, this information might also turn out to be 
insufficient: caret depends on 4 packages, and sug- 
gests another 59 packages, almost all of which have 
had numerous releases in R 2.13 time frame. Conse- 
quently, caret 4.78 might not work anymore because 
of changes in these dependencies. We then need to 
do further investigation to figure out which versions 
of the dependency packages were current at the time 
of the caret 4.78 release. Instead, lets assume that 
the prescient researcher anticipated all of this, and 
saved the full output of sessionlnfof) along with 
the Sweave document, directly after it was compiled. 
This output lists the version of each loaded package 
in the active R session. We could then go ahead and 
manually download and install R 2.13 along with all 
of the required packages from the archive. However, 
users on a commercial operating systems might be 
up for another surprise: unlike source packages, bi- 
nary packages aren't fully archived. For example, 
the only binary builds available for R 2.13 are respec- 
tively caret 5.13 on Windows, and caret 5.14 on OSX. 
Most likely, they will face the task of rebuilding each 
of the required packages from source in an attempt 
to reconstruct the environment of the author. 

Needless to say, this situation is suboptimal. For 
manually compiling a single Sweave document we 
might be willing to make this effort, but it does not 
provide a solid foundation for systematic or auto- 
mated reproducible software practices. If R is to re- 
ally support reproducible research, it needs better 
conventions and / or native support that is both ex- 
plicit and specific about contributed code. For an R 
script or Sweave document to stand the test of time, 
it should run out of the box on at least the same ver- 
sion of R that was used by the author. In this respect, 
R has higher requirements on versioning than open 
source software in general. Reproducible research 
does not just require a version that will make things 
work, but one that generates exactly the same out- 
put. Hence, in order to systematically reproduce re- 
sults R, package versions either need to be standard- 
ized, or explicitly expressed in the code. 

(3) Production applications 

R is no longer exclusively used by the local statisti- 
cian through an interactive console; it is increasingly 
powering systems, stacks and applications with em- 
bedded analytics and graphics. When R is part of 
say, an application used in hospitals to create on- 
demand graphics from patient data, the underlying 
code needs to be stable, reliable, and redistributable. 
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Within such an application, even a minor change in 
code or behavior can result in complete failure of 
the system and cannot easily be fixed or debugged. 
Therefore, when an application is put in production, 
software has to be completely frozen. 

An application that builds on R has been devel- 
oped and tested with certain version of the base soft- 
ware and R packages used by the application. In 
order to put this application in production, exactly 
these versions need to be installed and loaded by the 
application on production servers. Managing, dis- 
tributing and deploying production software with 
R is remarkably hard, due to limited native depen- 
dency versioning and the single global library de- 
sign. One might find out the hard way that an appli- 
cation that was working in one place doesn't work 
elsewhere, even though exactly the same operating 
system, version of R, and installation scripts were 
used. The problem of course is that the contributed 
packages constantly change. Problems become more 
complicated when a machine is hosting several ap- 
plications that were developed by different people 
using different packages or package versions. 

The default behavior of loading packages from a 
global library with bleeding edge versions is unsuit- 
able for building applications. Because the CRAN 
repository has no notion of stable branches, one man- 
ually needs to download and install the correct ver- 
sions of packages in a separate library for each ap- 
plication to avoid conflicts. This is quite tricky and 
hard to scale when hosting many applications. In 
practice, application developers might not even be 
aware of these pitfalls, and design their applications 
to rely on the default behavior of the package man- 
ager. They then find out the hard way that appli- 
cations start breaking down later on, because of up- 
stream changes or library conflicts with other appli- 
cations. 



Solution 1: staged distributions 

The problem of managing bottom-up decentralized 
software development is not new; rather it is a typi- 
cal feature of the open source development process. 
The remainder of this paper will explore two solu- 
tions from other open source communities, and sug- 
gest how these might apply to R. The current sec- 
tion describes the more classic solution that relies on 
staged software distributions. 

A "software distribution" (also referred to as a 
"distribution" or a "distro") is a collection of software 
components built, assembled and configured so that 
it can be used essentially "as is" for its intended pur- 
pose. Maintainers of distributions do not develop 
software themselves; they collect software from var- 
ious sources, package it up and redistribute it as a 
system. Distributions introduce a formal release cy- 
cle on the continuously changing upstream develop- 



ments and maintainers of a distribution take respon- 
sibility for ensuring compatibility of different pack- 
ages within a certain release of the distribution. Soft- 
ware distributions are most commonly known in the 
context of free operating systems (BSD, Linux, etc). 
Staging and shipping software in a distribution has 
proven to scale well to very large code bases. For 
example, the popular Debian GNU /Linux distribu- 
tion (after which R's package description format was 
modeled) features over 29000 packages with a large 
and complex dependency network. No single person 
is familiar with even a fraction of the code base that is 
hosted in this repository. Yet through well organized 
staging and testing, this distribution is known to be 
one of the most reliable operating systems today, and 
is the foundation for a large share of the global IT in- 
frastructure. 

The release cycle 

In a nutshell, a staged distribution release can be or- 
ganized as follows. At any time, package authors 
can upload new versions of packages to the devel 
pool, also known as the unstable branch. A release 
cycle starts with distribution maintainers announc- 
ing a code freeze date, several months in advance. 
At this point, package authors are notified to ensure 
that their packages in the unstable branch are up to 
date, fix bugs and resolve other problems. At the 
date of the code freeze, a copy (fork) of the unsta- 
ble repository is made, named and versioned, which 
goes into the testing phase. Software in this branch 
will then be subject to several iterations of intensive 
testing and bug fixing, sometimes accompanied by 
alpha or befa releases. However, the software ver- 
sions in the testing branch will no longer receive any 
major updates that could potentially have side ef- 
fects or break other packages. The goal is to con- 
verge to increasingly stable set of software. When af- 
ter several testing rounds the distribution maintain- 
ers are confident that all serious problems are fixed, 
the branch is tagged stable and released to the public. 
Software in a stable release will usually only receive 
minor non-breaking updates, like important compat- 
ibility fixes and security updates. For the next "major 
release" of any software, the user will have to wait 
for the next cycle of the distribution. As such, ev- 
eryone using a certain release of the distribution is 
using exactly the same versions of all programs and 
libraries on the system. This is convenient for both 
users and developers and gives distributions a key 
role in bringing decentralized open source develop- 
ment efforts together. 

R: downstream staging and repackaging 

The semi annual releases of the r-base software suite 
can already be considered as a distribution of the 29 

base and recommended packages. However, a major 
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difference is that in the case of R, the distribution is 
limited to software that has been centrally developed 
and released by the same group of people. Due to 
the lack of native support for dependency version- 
ing in R, several projects have introduced some form 
of downstream staging in order to create stable, re- 
distributable collections R software. This section lists 
some examples and explains why this is suboptimal. 
In the next section we will discuss what would be 
involved with extending the R release cycle to con- 
tributed packages. 

One way of staging R packages downstream 
is by including them in existing software distribu- 
tions. For example, Eddelbuettel and Blundell (2009) 
have wrapped some popular CRAN packages into 
deb packages for the Debian and Ubuntu systems. 
Thereby, pre-compiled binaries are shipped in the 
distribution along with the R base software putting 
version compatibility in the hands of the maintain- 
ers (among many other benefits). This works well, 
but requires a lot of effort and commitment from the 
package maintainer, which is why this has only been 
done for a small subset of the CRAN packages. Most 
distributions expect high standards on the quality of 
the software and package maintenance, which makes 
this approach hard to scale up to many more pack- 
ages. Furthermore, we are tied to the release cycle 
of the distribution, resulting in a somewhat arbitrary 
and perhaps unfortunate snapshot of CRAN pack- 
ages when the distribution freezes. Also, different 
distributions will have different policies on if, when 
and which packages they wish to ship with their sys- 
tem. 

Another approach is illustrated by domain- 
specific projects like BioConductor (genomic data) 
and REvolution R Enterprise (big data). Both these 
systems combine a fixed version of R with a custom 
library of frozen R packages. In the case of REvo- 
lution, the full library is included with the installer; 
for BioConductor they are provided through a ded- 
icated repository. In both cases, this effectively pre- 
vents the software from being altered unexpectedly 
by upstream changes. However, this also leads to a 
split in the community between users of R, BioCon- 
ductor, and REvolution Enterprise. Because of the 
differences in libraries, R code is not automatically 
portable between these systems, leading to fragmen- 
tation and duplication of efforts. E.g. BioConduc- 
tor seems to host many packages that could be more 
generally useful; yet they are unknown to most users 
of R. Furthermore, both projects only target a limited 
set of packages; they still rely on CRAN for the ma- 
jority of the contributed code. 

The goal of staging is to tie a fixed set of con- 
tributed packages to a certain release of R. If these 
decisions are passed down to local distributions or 
organizations, a multitude of conventions and repos- 
itories arise, and different groups of users will still be 
using different packages. This leads to fragmentation 



of the community by system or distribution chan- 
nel. Moreover, it is often hard to assess compatibility 
of third party packages, resulting in somewhat arbi- 
trary local decision making. It seems that the pack- 
age authors themselves are really in the best position 
to manage and control compatibility. This leads us to 
conclude that the only appropriate place to organize 
staging of R packages is further upstream. 

Branching and staging in CRAN itself 

Given that the community of R contributors evolves 
mainly around CRAN, the most desirable approach 
to organizing staging would be by integrating it with 
the publication process. Currently, CRAN is man- 
aged as what distributions would consider a devel- 
opment or unstable branch. It is the pool of bleed- 
ing edge packages, straight from the developers. The 
fact that R already has an semi-annual release cycle 
for the 29 base and recommended packages, would 
make it relatively straightforward to extend this cy- 
cle to CRAN packages. A snapshot of CRAN could 
be frozen along with every version of r-release, and 
new package updates would only be published to 
the r-devel branch. In practice, this could perhaps 
quite easily be implemented by creating a directory 
on CRAN for each release of R, containing symbolic 
links to the versions of the packages considered sta- 
ble for this release. In the case of binary packages 
for OSX and Windows, CRAN actually already has 
separate directories with builds for each release of R. 
However currently these are not frozen and contin- 
uously updated. In a staged repository, newly sub- 
mitted packages are only build for the current devel 
and testing branches; they should not affect previ- 
ous releases. Exceptions to this process could still be 
granted to authors that need to push an important 
update or bugfix within a stable branch, commonly 
referred to as backporting, but this should only hap- 
pen incidentally. 

To fully make the transition to a staged CRAN, 
the default behavior of the package manager must 
be modified to download packages from the stable 
branch of the current version of R, rather than the lat- 
est development release. As such, all users on a given 
version of R will be using the same version of each 
CRAN package, regardless on when it was installed. 
The user could still be given an option to try and 
install the development version from the unstable 
branch, for example by adding an additional param- 
eter to install .packages named devel=TRUE. How- 
ever when installing an unstable package, it must be 
flagged, and the user must be warned that this ver- 
sion is not properly tested and might not be working 
as expected. Furthermore, when loading this pack- 
age a warning could be shown with the version num- 
ber so that it is also obvious from the output that re- 
sults were produced using a non-standard version of 
the contributed package. Finally, users that would 
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always like to use the very latest versions of all pack- 
ages, e.g. developers, could install the r-devel re- 
lease of R. This version contains the latest commits 
by R Core and downloads packages from the devel 
branch on CRAN, but should not be used or in pro- 
duction or reproducible research settings. 

Organizational change 

Appropriate default behavior of the software is a key 
element to encourage adoption of conventions and 
standards in the community. But just as important 
is communication and coordination between reposi- 
tory maintainers and package authors. To make stag- 
ing work, package authors must be notified of up- 
coming deadlines, code freezes or currently broken 
packages. Everyone must realize that the package 
version that is current at the time of code freeze, will 
be used by the majority of users of the upcoming ver- 
sion of R. Updates to already released branches can 
only be granted in exceptional circumstances, and 
only if they are guaranteed to be maintain full back- 
ward compatibility. The policies of the BioConduc- 
tor project provide a good starting point and could 
be adapted to work for CRAN. 

Transitioning to a system of "stable" and "devel- 
opment" branches in CRAN, where the stable branch 
is conventional for regular users, could tremen- 
dously improve the reliability of the software. The 
version of the R software itself would automati- 
cally imply certain versions of contributed packages. 
Hence, all that is required to reproduce a Sweave 
document created several years ago, is which ver- 
sion of R was used to create the document. When de- 
ploying an application that depends on R 2.15.2 and 
various contributed packages, we can be sure that a 
year later the application can be deployed just as eas- 
ily, even though the authors of contributed packages 
used by the application might have decided to im- 
plement some breaking changes. And package up- 
dates that deprecate old functionality or might break 
other packages that depend on it, can be uploaded 
to the unstable branch without worries, as the stable 
branches will remain unchanged and users won't be 
affected. The authors of the dependent packages that 
broke due to the update can be warned and will have 
sufficient time to fix problems before the next stable 
release. 

Solution 2: modern package man- 
agement 

The previous section described the "classical" solu- 
tion of creating distributable sets of compatible, sta- 
ble software. This is a proven approach and has 
been adopted in some way or another by many open- 
source communities. However, one drawback of this 



approach might be that some additional coordina- 
tion is needed for every release. Another drawback 
might be that it makes the software a bit more con- 
servative, in the sense that regular users will gener- 
ally be using versions of packages that are at least a 
couple of months old. The current section describes 
a different approach to the problem that is used by 
for example the Javascript community. This method 
is both reliable and flexible, however would require 
some more fundamental changes in the R software to 
be implemented. 

Node.js and NPM 

One of the most recent and fastest growing open 
source communities is that of the node.js software 
(for short: node), a Javascript server system based 
on the open source engine V8 from Google. One 
of the reasons that the community has been able 
to grow rapidly is because of the excellent package 
manager and identically named repository, NPM. 
Even though this package manager is only 3 years 
old, it is currently hosting over 23000 packages with 
more than a million downloads every day, and has 
quickly become the standard way of distributing 
Javascript code. The NPM package manager is a 
powerful tool for development, publication and de- 
ployment of both libraries and applications. NPM 
addresses some problems that Javascript and R actu- 
ally have in common, and makes an interesting role 
model for a modern solution to the problem. 

□ODD 

The Javascript community can be described as de- 
centralized, unorganized and highly fragmented de- 
velopment without any quality control authority. 
Similar to CRAN, NPM basically allows anyone to 
claim a "package name" and start publishing pack- 
ages and updates to the repositories. The reposi- 
tory has no notion of branches and simply stores ev- 
ery version of a package indefinitely in its archives. 
However, a major difference with R is how the 
package manager handles installation, loading and 
namespacing of packages. 

Dependencies in NPM 

Every NPM package ships with a file named 
'package.json', which is the equivalent of the 
'DESCRIPTION' in R packages, yet a bit more ad- 
vanced. An overview of the full feature set of the 
package manager is beyond the scope of this paper, 
but the interested reader is highly encouraged to take 
a look over the fence at this well designed system: 
https : / / npmjs . org/doc/ json . html. The most rele- 
vant feature in the context CRAN is how NPM de- 
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clares and resolves dependencies. 

Package dependencies are defined using a com- 
bination of the package name and version range de- 
scriptor. This descriptor is specified using a simple 
dedicated syntax, that extends some of the standard 
versioning notation. Below a snippet taken from the 
'package.json' file in the NPM manual: 

"dependencies" : { 



foo" 


"1.0.0 - 2.9999.9999", 


bar" 


">=1 .0.2 <2 . 1 . 2", 


baz" 


">1 .0.2 <=2 .3.4", 


boo" 


"2.0.1", 


qux" 


" < 1 . . || >=2 .3.1 <2.4.5", 


asd" 


"http : //asdf . com/asdf . tar . gz" 


til" 


"~1.2", 


elf" 


"-1.2.3", 


two" 


"2.x", 


thr" 


"3.3.x", 



} 

The version range descriptor syntax is a powerful 
tool to specify which version(s) or version range(s) 
of dependencies are required. It provides the exact 
information needed to build, install and /or load the 
software. In contrast to R, NPM takes full advantage 
of this information. In R, all packages are installed in 
one or more global libraries, and at any given time a 
subset of these packages is loaded into memory. This 
is where NPM takes a very different approach. Dur- 
ing installation of a package, NPM creates a subdirec- 
tory for dependencies inside the installation directory 
of the package. It compares the list of dependency 
declarations from the 'package.json' with an index of 
the repository archive, and then constructs a private 
library containing the full dependency tree and pre- 
cise versions as specified by the author. Hence, every 
installed package has its own library of dependen- 
cies. This works recursively, i.e. every dependency 
package inside the library again has its own depen- 
dency library. 

jeroengubuntu : ~/Desktop$ npm install d3 
jeroengubuntu : ~/Desktop$ npm list 
/home/ jeroen/Desktop 
!— r c!3@2 .10.3 

— r- jsdom@0.2.14 

— r contextify@0 . 1 . 3 
I — bindingsgl.0.0 

— cssom@0.2.5 

— htmlparser@l . 7 . 6 
' — r request@2 .12.0 

— r form-data@0.0.3 
— async@0.1.9 
I — r combined-stream@0 . . 3 
I — delayed-stream@0 . . 5 
' — mime@l .2.7 
— sizzle@1.1.0 

By default, a package loads dependencies from its 
private library, and the namespace of the depen- 



dency is imported explicitly in the code. This way, an 
installed NPM package is completely unaffected by 
other applications, packages, and package updates 
being installed on the machine. The private library 
of any package contains all required dependencies, 
with the exact versions that were used to develop 
the package. A package or application that has been 
tested to work with certain versions of its dependen- 
cies, can easily be installed a year later on another 
machine, even though the latest versions of depen- 
dencies have had major changes in the mean time. 

Back to R 

A similar way of managing packages could be very 
beneficial to R as well. It would enable the same dy- 
namic development and stable installation of pack- 
ages that has resulted in a small revolution within 
the Javascript community. The only serious draw- 
back of this approach is that it requires more disk 
space and slightly more memory, but most users and 
developers will happily pay this small price for reli- 
able software and reduced debugging time. Unfortu- 
nately, implementing a package manager like NPM 
for R would require some fundamental changes in 
the way R installs and loads packages and names- 
paces, which might break backward compatibility at 
this point. One change that would probably be re- 
quired for this is to move away from the Depends re- 
lation definition, and require all packages to rely on 
Imports and a NAMESPACE file to explicitly import ob- 
jects from other packages. A more challenging prob- 
lem might be that R should be able to load multiple 
versions of a package simultaneously while keeping 
their namespaces separated. This is necessary for ex- 
ample when two different packages are in use, which 
both depend on different versions of one and the 
same third package. In this case, the objects, meth- 
ods and classes exported by the dependency pack- 
age should affect only to the package that imported 
them. 

Finally, it would be great if the package manager 
was capable of installing multiple versions of a pack- 
age inside the global library, for example by append- 
ing the package version to the name of the installa- 
tion directory (e.g. MASS_7 .3-22). The library and 
require functions could then be extended with an 
argument specifying the version to be loaded. This 
argument could use the same version range descrip- 
tor syntax that packages use to declare dependencies. 
Missing versions could automatically be installed, as 
nothing gets overwritten. 

library (ggplot2, version="0 .8.9") 
library (MASS, version="7 . 3-x" ) 
library (Matrix, version=">=l .0") 

Code as above leaves little ambiguity and 
tremendously increases reliability and reproducibil- 
ity of R code. When the author is explicit about 
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which package versions are used, and packages are 
explicit about dependency versions, an R script or 
Sweave document that once worked on a certain 
version of R, will work for other users, on differ- 
ent systems, and keep working over time, regard- 
less of upstream changes. For users not concerned 
with dependency versioning, the default value of the 
version argument could be set to "*". This value in- 
dicates that any version will do, in which case the 
package manager gives preference to the most recent 
available version of the package. 

The benefits of a package manager capable of im- 
porting specific versions of packages would not just 
be limited to contributed code. Once such a pack- 
age manager is in place, this would reduce the ne- 
cessity to include all of the standard library in the 
R releases. The R Core team could consider mov- 
ing some of the base and recommended packages out 
of the r-base distribution, and offer them exclusively 
through CRAN. This way, the R software could even- 
tually become the minimal core containing only the 
language interpreter and package manager, similar 
to e.g. Node and NPM. More high-level function- 
ality could be loaded on demand as versioning is 
controlled by the package manager. This would al- 
low for less frequent releases of the R software itself, 
and further improve compatibility and reproducibil- 
ity between versions of R. 

Summary 

The infrastructure for contributed code has sup- 
ported the steady growth and adoption of the R soft- 
ware. For the majority of users, contributed code is 
just as essential in their daily work as the R base soft- 
ware suite. But the number of packages on CRAN 
has grown beyond what could have been foreseen, 
and practices and policies that used to work on a 
smaller scale are becoming unsustainable. At the 
same time there is an increasing demand for more re- 
liable, stable software, that can be used as part of em- 
bedded systems, enterprise applications, or repro- 
ducible research. The design and policies of CRAN 
and the package manager shape the development 
process and play an important role in determining 
the future of the platform. The current practice of 
publishing package updates directly to end-users fa- 
cilitates a highly versatile development, but comes at 
the cost of reliability. The default behavior of R to in- 
stall packages in a single library with only the latest 
versions is perhaps more appropriate for developers 
than regular users. After nearly two decades of de- 
velopment, R has reached a maturity where a slightly 
more conservative approach could be beneficial. 

This paper explained the problem of dependency 
versioning, and tried to make a case for transitioning 
to a system that does not assume that package ver- 
sions are interchangeable. The most straightforward 
approach would be by extending the r-release and 



r-devel branches to the full CRAN repository, and 
only publish updates of contributed packages to the 
r-devel branch of R. This way, the released versions 
of R are tied to a fixed version of every CRAN pack- 
age, which makes the code base and behavior of a 
given release of R less ambiguous. Furthermore, it 
allows concentrating and coordination and testing of 
contributed packages along with releases of R, rather 
than continuously throughout the year. 

In the long term, a more fundamental revision of 
the packaging system could be considered, in order 
to facilitate dynamic contributed development with- 
out sacrificing reliability. However, this would in- 
volve major changes in the way libraries and names- 
paces are managed. But when the time is ready 
to make the jump to the next major release of R, 
we hope that R Core will consider revising this im- 
portant part of the software, adopting modern ap- 
proaches and best practices of package management 
that are powering collaboration and uniting efforts 
within other open source communities. 
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